Skip to content

init main repo structure and demonstrate the AR + DiT demo for omni models#6

Merged
hsliuustc0106 merged 20 commits intomainfrom
hsliu-dev-C
Sep 30, 2025
Merged

init main repo structure and demonstrate the AR + DiT demo for omni models#6
hsliuustc0106 merged 20 commits intomainfrom
hsliu-dev-C

Conversation

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 commented Sep 25, 2025

  • Add comprehensive PRD, architecture design, and test design documents
  • Implement core modules: OmniLLM, AsyncOmniLLM, stage configurations
  • Implement CLI integration with --omni flag support
  • Update dependencies to vLLM 0.10.2 and PyTorch 2.8.0
  • Implement stage-based processing architecture
  • Add multimodal output processing capabilities
  • Support the vllm serve --omni for 1) AR models only and 2) AR +DiT models

Test Plan:

we test the following scenarios:

  1. Model loading and server startup
  2. Health and info endpoints
  3. Text generation functionality
  4. Performance metrics
  5. API client integration
  6. AR → DiT diffusers pipeline
bash scripts/test_serving.sh

Test Results:

[SUCCESS] Import test passed
[INFO] Starting vLLM-omni server on port 8000...
[INFO] Command: vllm serve ./models/Qwen3-0.6B --omni --port 8000
[INFO] Server started with PID: 64101
[INFO] Waiting 15 seconds for server to initialize...
[SUCCESS] Server appears to be running
[INFO] Testing health endpoint...
[SUCCESS] Health check passed: {"status":"healthy","service":"vllm-omni"}
[INFO] Testing text generation...
[SUCCESS] Text generation test passed
Generated text: in a test case. This is a command line test case. The test case will run the server
[INFO] Running performance test...
[SUCCESS] Performance test passed (1s response time)
[INFO] Testing API client example...
[SUCCESS] API client test passed
[INFO] Testing AR → DiT diffusers pipeline example...
[SUCCESS] AR → DiT pipeline test passed (output: logs/ar_dit_pipeline.png)
[SUCCESS] All tests completed successfully!

==========================================


Note

Introduce vLLM-omni multi-stage (AR→DiT) pipeline with CLI vllm --omni, FastAPI server, diffusers-backed diffusion, configs/output processing, examples, tests, and scripts; update dependencies.

  • Core/Architecture:
    • Add multi-stage framework: OmniLLM/AsyncOmniLLM, StageManager, OmniRequest, and implementation docs.
    • Introduce configs: OmniStageConfig, DiTConfig, DiTCacheConfig (+ helpers) and remove legacy dit_cache_interface.
    • Add multimodal output processing (engine/output_processor.py).
  • Diffusion (DiT):
    • Add diffusers-backed pipeline: engine/diffusion_engine.py, worker/gpu_diffusion_model_runner.py, worker/gpu_diffusion_worker.py.
    • Add DiT cache manager (core/dit_cache_manager.py).
  • CLI & Server:
    • New CLI entrypoint intercepting vllm serve --omni (entrypoints/cli/*, pyproject scripts).
    • FastAPI server with /generate, /health, /info (entrypoints/api_server.py).
  • Examples & Scripts:
    • Add basic examples and AR→DiT diffusers example with YAML config (examples/*).
    • Add comprehensive serving test script and README (scripts/test_serving.sh).
  • Tests:
    • Add unit tests for configs and shared pytest fixtures (tests/*).
  • Dependencies:
    • Bump to vllm>=0.10.2, torch>=2.7; add PyYAML; expose new CLI scripts in pyproject.toml and expand requirements.txt dev tools.

Written by Cursor Bugbot for commit 05d2367. This will update automatically on new commits. Configure here.

- Add comprehensive PRD, architecture design, and test design documents
- Implement core modules: OmniLLM, AsyncOmniLLM, stage configurations
- Add DiT scheduler and cache manager for diffusion models
- Implement CLI integration with --omni flag support
- Add API server and plugin system for vLLM integration
- Create comprehensive test suite with fixtures
- Update dependencies to vLLM 0.10.2 and PyTorch 2.8.0
- Add conda environment setup and package installation
- Implement stage-based processing architecture
- Add multimodal output processing capabilities

This commit establishes the foundation for multi-modality models
inference and serving with non-autoregressive structures.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @hsliuustc0106, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the complete groundwork for vLLM-omni, an ambitious extension designed to transform vLLM into a versatile platform for multi-modal and non-autoregressive model inference. It encompasses the entire project setup, from foundational documentation and architectural blueprints to the initial implementation of core processing logic, specialized components for diffusion models, and user-facing interfaces. The changes establish a modular and extensible system capable of orchestrating complex multi-stage AI pipelines, significantly broadening vLLM's capabilities beyond traditional text-based autoregressive generation.

Highlights

  • Initial Repository Structure: This pull request establishes the foundational directory and file structure for the new vLLM-omni project.
  • Core Multi-modal Extension: Introduces vLLM-omni as an extension to vLLM, designed to support multi-modal model inference and serving, including non-autoregressive structures and non-textual outputs.
  • Comprehensive Documentation: Adds extensive documentation, including a Product Requirements Document (PRD), detailed high-level and implementation architecture designs, API design templates, and a comprehensive test design document.
  • Multi-Stage Processing Framework: Implements core classes like OmniLLM and AsyncOmniLLM to enable sequential, multi-stage processing of requests, integrating both autoregressive (AR) and Diffusion Transformer (DiT) models.
  • DiT-Specific Components: Introduces specialized components for Diffusion Transformers, such as DiTCacheManager for optimized caching and OmniDiffusionScheduler for DiT-specific scheduling logic.
  • CLI and API Integration: Provides a command-line interface (CLI) wrapper that intercepts vLLM commands with an --omni flag, and a FastAPI-based API server for online inference.
  • vLLM Plugin System Integration: Registers vLLM-omni as a plugin within the vLLM ecosystem, allowing for seamless extension and overriding of vLLM's default behavior.
  • Enhanced Request Handling: Extends the base vLLM request object (OmniRequest) to support multimodal inputs, diffusion parameters, and track processing stages and intermediate outputs.
  • Output Processing: Implements a MultimodalOutputProcessor to handle and format diverse outputs, including text, images, and latent representations, from various model stages.
  • Development Environment Setup: Updates requirements.txt with necessary dependencies and development tools, and refactors conftest.py to establish a robust testing environment.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request initializes the repository structure for vllm-omni, a multi-modal extension for vLLM. It includes extensive documentation covering product requirements, architecture, and testing design, along with skeleton code for the core components. The overall structure is well-thought-out and aligns with the project's goals.

My review focuses on identifying potential issues in the initial implementation. I've found a critical import error that will break the code, some incorrect logic in the cache manager and scheduler, and several typos in the documentation. I've also noted a dependency on a non-existent PyTorch version which will cause installation failures. Addressing these points will help build a more robust foundation for the project.

Comment thread vllm_omni/config/stage_config.py
Comment thread requirements.txt
Comment thread vllm_omni/core/dit_cache_manager.py
Comment on lines +150 to +159
def _create_seq_group_from_request(self, request: Dict[str, Any]) -> Any:
"""Create a sequence group from a DiT request."""
# This would create a proper sequence group
# For now, we'll return a mock implementation
from vllm.v1.core.sched.sequence import SequenceGroup

# Mock sequence group creation
# In practice, this would properly create a SequenceGroup
# with the appropriate metadata for DiT processing
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _create_seq_group_from_request method currently returns None. This will cause a TypeError when the return value is used, for example, when it's appended to scheduled_seq_groups and then iterated over. This method should return a valid SequenceGroup object or a placeholder that doesn't break downstream logic.

Comment thread vllm_omni/core/sched/diffusion_scheduler.py Outdated
Comment thread docs/architecture/detailed arch design.md Outdated
prompt_str, engine_request, tokenization_kwargs = self._process_stage_inputs(stage_config, **stage_args)

# Add inputs to Engine
stage_engine.add_request(requesy_id, prompt_str, tokenization_kwargs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in 'requesy_id'. It should be 'request_id'.

Suggested change
stage_engine.add_request(requesy_id, prompt_str, tokenization_kwargs)
stage_engine.add_request(request_id, prompt_str, tokenization_kwargs)

Comment thread docs/architecture/detailed arch design.md Outdated
Comment thread docs/architecture/detailed arch design.md Outdated
Comment on lines +71 to +86
response_outputs = []
for output in outputs:
if hasattr(output, 'outputs') and output.outputs:
for out in output.outputs:
response_outputs.append({
"text": getattr(out, 'text', ''),
"finished": getattr(out, 'finish_reason', 'length') != 'length',
"tokens": getattr(out, 'token_ids', [])
})
else:
response_outputs.append({
"text": "",
"finished": True,
"tokens": []
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The response generation logic in the /generate endpoint seems to only handle text-based outputs. It extracts text, finish_reason, and token_ids from the RequestOutput. This is inconsistent with the project's goal of supporting multimodal outputs (like images), and the MultimodalOutputProcessor which is designed to produce outputs with image or latent data. The response model and logic should be updated to handle and serialize multimodal outputs correctly.

cursor[bot]

This comment was marked as outdated.

hsliuustc0106 and others added 4 commits September 26, 2025 15:47
- Improve API server with better error handling and response formatting
- Enhance CLI with additional options for DiT stages and configuration
- Add comprehensive examples in examples/basic/ including:
  - API client with health checks and text generation
  - Docker setup and usage examples
  - Simple usage patterns for different scenarios
- Add utility scripts for model downloading and Docker setup
- Update documentation with implementation details and testing guidelines
- Fix configuration validation issues in OmniLLM
- Improve stage configuration handling for AR and DiT stages
- Add proper error handling and fallback mechanisms

Tested with Qwen3-0.6B model:
- Server starts successfully on port 8000
- Health and info endpoints working correctly
- Text generation with various parameters functioning
- API client examples working as expected
- CLI help and configuration options working properly
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

- Move vllm_omni/core/omni_llm.py to vllm_omni/entrypoints/omni_llm.py
- Update all import statements across the codebase to reflect new location
- Fix relative imports within the moved file
- Maintain functionality while improving code organization
- All imports and functionality tested and working correctly

This change better reflects that OmniLLM and AsyncOmniLLM are the main entry points
for vLLM-omni functionality, rather than core implementation details.
- Add test_serving.sh: Full-featured testing suite with comprehensive validation
- Add quick_test.sh: Fast validation script for quick testing after changes
- Add scripts/README.md: Complete documentation for testing scripts
- Include health checks, text generation, performance testing, and API integration
- Add retry mechanisms and proper error handling
- Support for different models and ports
- Comprehensive logging and colored output
- Ready for CI/CD integration

Usage:
- Quick test: ./scripts/quick_test.sh [port]
- Full test: ./scripts/test_serving.sh [model_path] [port]
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

input_modalities=["text"],
output_modalities=["text"]
)
stage_configs.append(ar_config)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Unreachable Fallback in Stage Configuration

The if not stage_configs: condition is unreachable because an AR stage is always added to stage_configs earlier. This means the fallback logic, intended for when no specific stages are configured, will never execute.

Fix in Cursor Fix in Web

prompt_logprobs=None,
outputs=[mock_output],
finished=True
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Async Class Calls Sync Method

The AsyncOmniLLM class isn't fully asynchronous. It inherits from LLM and its generate_async method calls the synchronous super().generate(), which blocks the event loop. Additionally, within _execute_stage_async, DiffusersPipelineEngine is initialized with parameters that don't align with its constructor signature.

Fix in Cursor Fix in Web

@hsliuustc0106 hsliuustc0106 changed the title initilization of main repo structure init main repo structure and demonstrate the AR + DiT demo for omni models Sep 30, 2025
@hsliuustc0106 hsliuustc0106 merged commit 5a503f3 into main Sep 30, 2025
1 check passed
R2-Y pushed a commit to R2-Y/vllm-omni that referenced this pull request Jan 15, 2026
correct _thinker_to_talker_prefill to handle multiple segments inside one chunk
yinpeiqi referenced this pull request in yinpeiqi/vllm-omni Mar 12, 2026
Sy0307 added a commit to Sy0307/vllm-omni that referenced this pull request Apr 10, 2026
P0 fixes:
  vllm-project#1: _free_scaffold_weights now shrinks storage to zero (actually
      releases VRAM). Only runs when SKIP_SCAFFOLD is also set.
      Called lazily after first prefill, not at load time.
  vllm-project#2: Sliding VAE default OFF (splice algorithm had alignment bug).
      _sliding_vae_decode now falls back to full decode until proper
      overlap-add is implemented.
  vllm-project#3: Complete per-request state reset in preprocess: now clears
      _curr_prefix_feat_cond, _last_audio_patch_gpu, _prev_audio,
      _prev_audio_len, _decode_step_count, _precomputed_stop_logits.
  vllm-project#4: compute_logits fallback forces stop (not continue) when
      _prefill_completed=True, preventing runaway generation.
  vllm-project#5: Scaffold VRAM: load_weights no longer frees immediately;
      _free_scaffold_weights called after first prefill completes,
      so scaffold is available for prefill then released.

P1 fixes:
  vllm-project#6: Log all active config flags at load time.
  vllm-project#7: Remove dead _STOP_CHECK_INTERVAL code.
  vllm-project#8: Remove broken audio_duration formula from postprocess.
  vllm-project#9/vllm-project#14: Move `from einops import rearrange` to module top level.
  vllm-project#11: Remove torch.no_grad() context from _forward_decode_graphable
       (incompatible with CUDA Graph capture).
Sy0307 added a commit to Sy0307/vllm-omni that referenced this pull request Apr 10, 2026
P0 fixes:
  vllm-project#1: _free_scaffold_weights now shrinks storage to zero (actually
      releases VRAM). Only runs when SKIP_SCAFFOLD is also set.
      Called lazily after first prefill, not at load time.
  vllm-project#2: Sliding VAE default OFF (splice algorithm had alignment bug).
      _sliding_vae_decode now falls back to full decode until proper
      overlap-add is implemented.
  vllm-project#3: Complete per-request state reset in preprocess: now clears
      _curr_prefix_feat_cond, _last_audio_patch_gpu, _prev_audio,
      _prev_audio_len, _decode_step_count, _precomputed_stop_logits.
  vllm-project#4: compute_logits fallback forces stop (not continue) when
      _prefill_completed=True, preventing runaway generation.
  vllm-project#5: Scaffold VRAM: load_weights no longer frees immediately;
      _free_scaffold_weights called after first prefill completes,
      so scaffold is available for prefill then released.

P1 fixes:
  vllm-project#6: Log all active config flags at load time.
  vllm-project#7: Remove dead _STOP_CHECK_INTERVAL code.
  vllm-project#8: Remove broken audio_duration formula from postprocess.
  vllm-project#9/vllm-project#14: Move `from einops import rearrange` to module top level.
  vllm-project#11: Remove torch.no_grad() context from _forward_decode_graphable
       (incompatible with CUDA Graph capture).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant